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by 

Ruby  Bei-Loh  Lee 
ABSTRACT 

A general  model  of  computation  on  a p-parallel  processor  is  proposed, 

distinguishing  clearly  between  the  logical  parallelism  (p*  processes'/  inherent 

in  a computation,  and  the  physical  parallelism  (p  processors)  available  in  the 

computer  organization.  This  shows  the  dependence  of  performance  bounds  on 

both  the  computation  being  executed  and  the  computer  architecture.  We  formally 

derive  necessary  and  sufficient  conditions  for  the  maximum  attainable  speedup 

of  a p-parallel  processor  over  a uniprocessor  to  be  Sp  £ mi n ( Tn^p  * 1 n^p* ^ ’ 

where  In  p approximates  H , the  pth.  harmonic  number.  We  also  verify  that 

P n* 

empirically-derived  speedups  are  Q(^n  pv  )•  Finally,  we  discuss  related  per- 
formance measures  of  minimum  exeuction  time,  maximum  efficiency  and  minimum 
space-time  product. 


The  work  described  herein  was  supported  by  the  Joint  Services  Electronics 
Program  under  Contract  No.  N00014-75-C-0601 . 
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1.  INTRODUCTION 


The  purpose  of  this  paper  is  to  estimate  the  maximum  attainable  speedup 
of  a computation,  executing  on  a computer  organization,  within  a given  technology. 
The  questions  we  ask  are:  What  is  the  minimum  execution  time,  and  hence  the 

maximum  speedup,  of  a given  computation,  assuming  unlimited  computer  resources? 

Then,  given  a particular  type  of  computer  organization,  we  ask  what  aspects  of 
a computation  affect  the  raw  speed  of  this  computer  and  to  what  extent? 

The  computer  organization  we  wish  to  consider  is  one  where  p identical 
processors  operate  in  parallel  on  different  instructions.  This  belongs  to  the 
general  class  of  MIMD  (Multiple  Instruction  Multiple  Data)  organizations  [1]. 

We  wish  to  study  the  speedup  attainable  by  varying  the  number,  p,  of  processors 
used.  The  basis  of  comparison  is  always  that  of  the  uniprocessor  where  p=l. 

Since  the  parameter  of  concern  is  the  number  of  processors,  p,  we  will  assume 
that  there  are  unlimited  supplies  of  memories  and  input-output  devices  so  that 
the  computation  is  always  processor-bound,  with  no  delays  due  to  memory  faults 
and  interference,  or  device  interrupts. 

There  are  many  motivations  for  considering  the  parallel  processor  organization. 
The  first  is  that  it  has  immediate  practical  appeal.  The  rapidly  decreasing  cost 
of  LSI  microprocessors  makes  it  economically  feasible  to  consider  using  a whole 
army  of  processors  within  the  computer  architecture  to  speedup  a computation, 
even  at  reduced  efficiency  of  each  component  processor.  No  longer  is  the  processor 
the  hallowed  CPU  (Central  Processing  Unit),  or  the  most  valuable  resource  which 
has  to  be  utilized  with  the  greatest  efficiency.  Of  course,  an  accpetable  level 
of  efficiency  has  to  be  obtained,  but  more  importantly,  we  need  to  find  out  what 
sort  of  speedup  is  in  fact  possible  by  increasing  the  number  of  processors,  even 
if  we  ignore  the  problems  of  control  and  communication  which  must  accompany  the 
cooperation  and  competition  between  these  processors.  This  brings  us  to  our 
second  basic  motivation  for  studying  the  parallel  processor  organization. 

Controversial  Views 

In  fact,  quite  a lot  of  controversy  and  "folklore"  has  built  up  around  the 
issue  of  speedup  bounds  for  parallel  processors.  It  is  clear  that  the  nature 
of  the  computation  will  limit  the  maximum  performance  of  the  parallel  processors, 
but  to  what  extent? 


Amdahl  [8]  suggested  that  the  amount  of  strictly  serial  operations  inherent 
in  a computation  point  to  a uniprocessor  approach  to  computing,  if  some 
"acceptable"  level  of  efficiency  is  to  be  achieved.  The  model  that  he  used, 
however,  was  very  simple  where  the  number  of  processors  that  could  be  used  at 
any  time  was  either  1 or  p.  In  this  paper,  we  take  the  more  general  approach 
where  any  number  of  processors  between  1 and  the  maximum  number  p,  may  be  used 
s imul taneously. 

Another  view,  which  has  come  to  be  known  as  "Minsky's  Conjecture",  suggests 
that  the  speedup  is  proprotional  to  log^p  in  most  cases  [6,7].  Flynn  [1]  has 
supported  this  view  for  SIMD  (Single  Instruction  Multiple  Data)  organizations, 
and  has  proposed  an  explanation  for  it  based  on  a special  kind  of  nested 
branching  degradation  in  the  program.  It  is  not  clear  whether  such  kind  of 
branching  degradation  does  in  fact  occur  in  programs,  and  if  so,  how  common  this 
is.  Perhaps  what  is  more  important  is  that  we  must  somehow  specify,  in  terms  of 
the  parameters  of  a general  model  of  computation,  the  conditions  under  which 
certain  speedup  bounds  are  the  maximum  (or  minimum,  or  average)  attainable.  Then, 
by  empirical  observations  of  program  behavior,  we  can  see  if  such  conditions  are 
indeed  met.  This  is  the  approach  that  we  take,  in  this  paper,  to  the  problem  of 
finding  speedup  bounds  for  parallel  processors. 

If  we  do  find  that,  "in  all  but  a finite  number  of  exceptions",  the  speedup 
is  proportional  to  log2p,  then  the  parallel  processor  organization  is  obviously 
not  a very  effective  speedup  mechanism.  For  example,  to  achieve  an  order  of 
magnitude  speed  improvement,  3 orders  of  magnitude  more  processors  have  to  be 
used,  with  an  efficiency  of  only  1%  of  the  uniprocessor.  Fortunately,  the  rather 
extensive  empirical  experiments  of  Kuck,  et  al  (.3,4,5],  show  that  the  attainable 
speedup  is  almost  always  better  than  log^p. 

Kuck,  et  al  L3,4,5,]  have  written  a fairly  sophisticated  program  analyser 
which  accepts,  as  input,  an  ordinary  program  written  for  execution  on  a uniprocessor, 
and  turns  it  into  a program  suitable  for  execution  on  a system  with  p*  parallel 
processors,  where  p*  is  the  maximum  number  of  processors  which  can  be  simultaneously 
used  in  the  converted  program.  The  point,  of  course, is  to  minimize  the  execution 
time  of  the  converted  program,  since  the  maximum  speedup  is  inversely  proportional 
to  the  minimum  execution  time.  Based  on  experiments  using  this  analyser,  Kuck 
has  proposed  the  following  observations  [3]: 
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"For  many  ordinary  Fortran  programs  (with  T-|  <_  10,000),  we  can  find 
p such  that 

(1)  Tp  = otlog2T.j  , for  2<-a<J0 

a"d  (2)  Pi  -.sToa'jT, 

such  tnat 

(3)  S > T1  and  E > • 3 " 
p_  10  log2T1  p 

Here  is  the  time  taken  by  the  uniprocessor,  Tp  is  the  time  taken  by 
p parallel  processors,  Sp  is  the  speedup  and  Ep  is  the  efficiency. 

We  notice  three  points:  he  has  chosen  to  express  speedup  in  terms  of  , 
the  time  taken  by  a uniprocessor,  rather  than  p,  the  number  of  processors  used. 
This  is  because  he  almost  always  uses  a system  with  p*  parallel  processors, 
where  p*  is  the  maximum  number  of  processors  that  can  be  simultaneously  used, 
according  to  the  result  of  his  program  analyser. 

The  second  point  of  difference  is  that  he  considers  the  lower  bound  for 
speedup,  whereas  in  this  paper,  we  are  interested  in  the  upper  bound. 

The  third  point  is  that  he  has  not  given  any  theoretical  proof  of  his 
empirically-derived  formula. 

However,  his  experimental  results  form  the  main  source  of  raw  data,  with 

which  one  may  compare  any  speedup  bounds  obtained  by  non-empi rical  methods, 

like  the  mathematical  derivations  used  in  this  paper.  Also,  we  have  used 

essentially  the  same  definitions  (see  next  section)  of  the  performance  measures 

of  T , S and  E . It  was  indeed  gratifying  to  discover  that  the  upper  bound 
P P P D 

for  speedup,  Sp  < that  we  found  by  mathematical  observations  agreed  very 

well  with  Kuck's  experimental  results. 


Footnote:  In  is  the  natural  logarithm  (base  e)  function,  differing  from 

the  logarithm  to  any  other  base,  by  only  a constant,  since 
l°9ax=(  • logbx 


2.  PROBLEM  DEFINITION 


The  problem  is  to  find  the  bounds  on  the  performance  improvement  of 
a p-parallel  processor  over  a uniprocessor,  for  a given  computation. 

A Model  of  Computation  on  a P-Paral lei  Processor 

A computation  is  a sequence  of  steps.  At  each  step  s,  a finite  number, 
k,  of  instructions  may  be  simultaneously  executed.  Step  s is  then  said  to 
contain  k parallel  processes . The  relevant  parameters  for  any  given 
computation  are: 

(i)  p*  : the  maximum  number  of  parallel  processes  contained  in 

any  step  of  the  computation 

(ii)  (r . ,l<i<p*>  : The  probability  that  i parallel  processes  are 

1 n* 

contained  in  a single  step,  A 

S r<sl- 

' i i i ) T-j  : the  time  taken  to  execute  the  whole  computation  by  a 

uniprocessor  (equivalent  to  the  total  number  of  instructions 
executed) . 

A p-parallel  processor  is  a computer  organization  with  p identical 
processors,  each  of  which  is  capable  of  executing  one  instruction  (not 
necessarily  the  same  type  of  instruction)  per  time-unit.  A time-unit  is 
defined  to  be  the  amount  of  time  taken  by  a uniprocessor  (or  1 -parallel 
processor)  to  execute  one  instruction,  and  each  instruction  is  assumed  to 
take  the  same  amount  of  time  for  execution.  Any  number  of  processors,  from  1 
to  p,  may  execute  simultaneously  in  a time-unit.  The  relevant  parameters  for 
any  given  p-parallel  processor  are: 

(i)  p : the  maximum  number  of  parallel  processors  in  the  computer 

organization 

(ii)  {q.j,l<i<p}  : the  probability  that  i parallel  processors  are 

simultaneously  used  in  a time-unit,  A 

i = l 1 

The  e_xecutk)n  of  a computation  on  a p-parallel  processor  consists  of  a 
mapping  from  steps  in  the  computation  to  time-units  in  the  computer  organization. 
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and  a corresponding  mapping  of  the  probabilities  fr- ,l<i<p*}  inherent  in 
the  computation  to 
napping  is  clearly: 


the  computation  to  {q.,UKp}  for  the  p-parallel  processor.  If  p^p*,  the 


i r . , l<i<p* 

qi  = 1 ~~ 

iO,  p*+l£1_<p 

Hence,  for  p>p*,  the  number  of  time-units  taken  for  execution  is  equal  to 
the  number  of  steps  originally  present  in  the  computation,  and  the  computation 
is  said  to  be  executing  at  maximum  speed. 

If  p<p*,  the  mapping  is  not  unique,  and  we  will  assume  that  the  "optimal 
mapping"  has  been  used,  i.e.,  the  resulting  values  of  q^  correspond  to  the 
minimum  execution  time  on  the  given  p-parallel  processor. 


Definition  of  Pe rformance  Me as ures 

(1)  Tp  is  the  number  of  time-units  taken  by  a p-parallel  processor  to 
execute  a given  computation. 

(2)  Sp  is  the  speedup  of  the  p-parallel  processor  over  the  uniprocessor, 
for  the  same  computation: 


- 1 execution  time  on  uniprocessor 

p Tp  execution  time  on  p-parallel  processor 

(3)  E is  the  efficiency  of  the  p-parallel  processor 

P Sn 

E =— 2 

P P 

(E  compares  the  actual  execution  bandwidth,  Sp,  to  the  maximum  possible 
bandwidth,  p). 

(4)  STp  is  the  space-time  product  for  executing  a given  computation  on  a 
p-parallel  processor: 


ST 


PT, 


ST 


(The  relative  space-time  product,  p , is  inversely  proportional  to 

ST1 

the  efficiency  , Tp,  and  can  be  said  to  measure  the  "cost"  of  using  the 
p-parallel  processor  compared  with  a uniprocessor.) 


Immediate  Results 


Fact  2.1:  The  execution  time  taken  by  a p-parallel  processor,  for  a given 


computation,  is 


T >1,  y,i  ) 

p ' & < i ) 
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Fact  2.2:  The  speedup  of  a p-parallel  processor,  for  a given  computation,  is 


sp  i r'r 

Eh1) 


Fact  2.3:  For  any  given  computation,  the  absolute  lower  bound  on  the  execution 

time  is  * 

V>T1  E(i) 

i=l 

Fact  2.4:  For  any  given  computation,  the  absolute  upper  bound  on  the  speedup 

attainable  is 


JL  r. 
E (-r) 


Fact  2.5:  1 £ Sp  < mi  n ( p , p* ) 

The  above  results  are  stated  without  proof  since  they  are  derived  directly 
from  the  definitions. 

We  note  that  Facts  2.1  and  2.2  are  architecture-dependent  bounds,  whereas 
Facts  2.3  and  2.4  are  computation-dependent  bounds.  The  inequalities  in 
Facts  2.1  to  2.4  may  be  replaced  by  equalities,  if  we  do  not  insist  that  Tp 
be  expressed  as  an  integral  number  of  time-units.  Also,  in  Facts  2.1  and  2.2, 
we  can  implicitly  take  care  of  the  probability  that  all  processors  are  idle  if 
we  intepret  q^  as  the  probability  that  0 or  1 processors  are  used  in  a time-unit. 

Fact  2.5  illustrates  that,  in  general,  the  performance  is  limited  by  both 
the  architecture  and  the  computation  being  executed. 


-6- 


3.  DERIVATION  OF  TIGHTER  UPPER  BOUND  FOR  SPEEDUP 


d 


)ical  Speedup  Ratios 


It  seems  clear  that  the  nature  of  the  computation  (or  "program  behavior") 
will  limit  the  actual  speedup  obtainable  by  using  a p-parallel  processor. 

It  is  therefore  instructive  to  review  some  of  the  best  speedup  ratios  obtained 
in  the  past,  for  various  specialized  types  of  computations.  Such  typical  speed- 
up ratios  are  on  the  order  of  [9-13,3]: 

(i)  k-jP  : matrix  computations 

(ii)  kp— --  : sorting 

^ P tridiagonal  linear  systems 
linear  recurrence  relations 
polynomial  evaluation 

evaluation  of  arithmetic  expressions  without  division 


(iii)  k^  log  p : searching  ordered  list  (actually  log,>(p+l)) 

(iv)  k^  : some  nonlinear  recurrence  relations 

(independent  of  p)  some  c°"<Pller  routines 


In  each  case,  the  k.  are  some  constants,  and  the  p often  refer  to  the 
maximum  number  of  logical  parallel  processes  within  the  computation,  rather  than 
to  the  maximum  number  of  physical  parallel  processors  actually  present  in  the 
computer  system.  Hopefully,  we  have  eliminated  this  kind  of  confusion  in  our 
model,  by  using  p*  for  the  maximum  number  of  processes,  and  p for  the  maximum 
number  of  processors. 

We  note  the  following  points: 

(1)  The  types  of  computations  which  have  a linear  speedup,  proportional 
to  k^p  for  k-j<J  , are  rather  rare.  They  tend  to  have  large  amounts 
of  inherent  iterative  structure,  acting  on  disjoint  domains. 

(2)  Since  it  is  clear  from  Fact  2.5  that  in  fact  S <p  for  all 

P- 

computations,  the  upper  bound  of  k^p  does  not  give  us  any  more 
i nformation. 


(3)  That  the  speedup  of  computations  can  be  independent  of  p is  also 
clear  from  Fact  2.5  ( p* ) and  Fact  2.2,  where 


SP-  P Qi 

L -T 

i = l 


<.  - — , since  each  term  is  non-negative. 
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It  is  interesting  to  note  that  the  speedup  is  limited  by  the 
reciprocal  of  the  proportion  of  time  that  at  most  1 processor  is 
used,  i.e.,  by  the  inherent  serial  nature  of  the  computation. 

This  is  essentially  similar  to  Amdahl's  argument  in  favor  of 
uniprocessors,  and  also  agrees  well  with  empirical  data  from 
various  sources.  For  example  Flynn  [2]  cites  statistics  that 
16.5%  of  the  instructions  executed  in  "General  Technical"  type 
computations  are  conditional  (nonresol vable)  branch  instructions. 

Also,  studies  on  the  Atlas  machine  in  England  [22]  indicate  that 
conditional  branch  instructions  form  about  10%  of  all  instructions 
executed.  Since  it  is  often  hypothesized  that  conditional  branches 
introduce  inherent  serial  ism  into  computations,  we  could  say  that 
empirical  data  suggest  that  q_j> 0 * 1 , agreeing  with  Amdahl's 
"private  statistics".  Hence,  this  implies  that  ^ 10.  In  fact, 

Kuck's  [3]  experimental  results  for  the  value  of  S , averaged  over 
a large  class  of  different  types  of  computations  is  Sp^9*8. 

Fortunately,  however,  the  variance  is  large,  and  many  individual 
types  of  computations  have  speedups  greater  than  this. 

This  processor-independent  upper  bound  of  S^£  is  even  more 

sobering  when  we  consider  that  includes  not  or'ily  the  probability 

that  1 processor  is  used,  but  also  the  probability  that  all  processors 

are  idle.  Hence,  if  the  system's  resources  are  not  well-balanced,  e.g., 

delays  due  to  memory  faults  dominate  the  execution  time,  then  the 

probability  that  all  processors  will  be  idle  will  be  very  high, 

jacking  up  the  value  of  q. , irregardless  of  the  number,  p,  of  physical 

1 1 

parallel  processors.  Hence  S ^0(-  ),  for  all  p,  a most  gloomy 

P Hi 

proposition.  Intuitively,  this  is  clear: 

Observation  3.1: 

(i)  S < — , for  all  p.  If  q.  is  large,  then  S % 1,  since 
P ^ • P q-| 

0<q,_<l,  and  increasing  the  number  of  processors  will  not 
speed  up  the  computation. 

(ii)  q.j  will  be  large  if  the  computation  is  not  processor  bound 
and/or  the  computation  is  highly  serial  in  nature. 

(4)  On  the  more  optimistic  side,  we  note  that  many  types  of  computations 
can  achieve  higher  speedups.  Of  the  two  remaining  typical  speedup 
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ratios,  O(y^--p)  is  achieved  by  more  types  of  computations  than  the 

lower  value  of  0(log  p).  This  suggests  that  we  examine  the  bound 

more  closely. 

1 og  p 


Intuitive  Motivation  of  the  Expression  S < 

K p 1 n p 

It  is  clear  from  Fact  2.5  that  all  computations  executing  on  all  p-parallel 
processors,  have  a maximum  speedup  of  0(p). 

It  is  our  intent  to  try  to  extablish  the  next  tighter  bound  0(y^— ),  and 

show  the  conditions  under  which  this  is  the  maximum  attainable  speedup.  We  will 
first  try  to  motivate  the  discussion  by  giving  the  following  graphical  or  physical 
interpretation  to  the  crucial  expression,  y,  , in  both  the  and  Sp  bounds 

i = l 1 

(see  Facts  2.1  to  2.4). 

Let  f(i)=  i , for  i=l,2,...,p.  Then  since  are  probabilities  which  sum  to  1 
the  expression  p q.  is  clearly  the  weighted  average,  or  mean,  of  the  function. 
i=l  1 

The  graph  of  f(i)  versus  i is  plotted  in  Figure  1,  for  p=6. 

If  we  give  each  value  of  f(i)  equal  weight,  then  q^=q2=...=  ^ and 


qi  1 


(sO 


pP  , where  is  the  pth.  Harmonic  number,  defined  as 


fol 1 ows : 


Definition  3.1  : -r-  , called  the  pth  harmonic  number. 

i = l 

The  expression  will  be  fundamental  in  establishing  our  speedup  bound.  It 
P 

is  "well-known"  in  mathematical  circles  (see  e.g.,  [21])  that 

Fact  3.1:  H=lnp+y+1-  l + _1 - c 

2p  12n2  120n4 

where  y = 0-57721...,  called  "Euler's  constant" 

0 < r - 1 


Hence,  H^  = In  p + Y + 0(1 ) 
> In  p 
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Next,  we  note  that  f(i)  is  a strictly  decreasing  function,  so  that  if 
we  assign  the  "weights"  qi  such  that  the  smaller  values  of  i get  larger  values 
of  q. , then  the  "balance-point"  of  the  curve  f(i)  will  shift  to  the  left,  and 
the  mean  of  f(i),  viz,  X,  f qA  , will  be  greater  than  the  equally-weighted 


case, 


(i)5  viz,  ^ c] ^-\  , will  bG 
i=l  1 ' ' 

(The  reverse  situation  causes 


).  See  Figure  2. 


In  terms  of  the  p-parallel  processor,  "assigning  larger  values  of  q^  to 
smaller  values  of  i"  simply  means  that  "the  probability  that  a smaller  number 
of  processors  is  used  in  a time-unit  is  greater  than  the  probability  that  a 
larger  number  of  processors  is  used".  This  seems  to  be  a very  general  condition  that 
is  usually  satisfied  in  practice.  We  will  formalise  these  ideas  in  the  next  section. 

Theorectical  Deri vation 

In  Lemma  3.1,  we  consider  the  "Equal ly-Likely  Hypothesis"  to  derive 

S < . The  fundamental  theorem  (theorem  3.1)  then  specifies  precisely, 

In  p 

the  necessary  and  sufficient  conditions  on  the  computation  for  the  upper  bound 

S < jp to  hold. 

p 1 n p 

In  corollaries  3.1  and  3.2,  we  show  some  less  general,  but  sufficient 
conditions  for  the  upper  bound  to  hold.  These  sufficient  conditions  are 
easier  to  check  for  in  a given  computation,  and  lend  more  easily  to  physical 
interpretation. 

In  corollary  3.3,  we  show  that  the  upper  bound  for  Sp  need  not  necessarily 
decrease  when  we  use  less  physical  processors,  p,  than  the  maximum  number  of 
logical  processes,  p*,  present  in  the  computation.  This  is  due  to  the  probability 
distribution  of  the  actual  number  of  processors  used,  as  indicated  by  the 
necessary  and  sufficient  conditions  given  in  Theorem  3.1. 

Finally,  in  corollaries  3.4  to  3.6  we  summarize  our  results,  and  plot  them  in 
Figures  5 to  7. 

Lemma  3.1  If  q,=q  =...=q  and  p q.  =1,  then  S < j)  < p 
1 1 p Z.  1 p ~ H In  p 
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Proof:  S < 

p - 


E^ 

i.i  ' 


by  Fact  2.2 


1 


1 P 1 

p £ 1 

i=l 


since  q]=q2=...=qp  and  ^ q.=l 

i=l 


H 


by  definition  of  H 


< p 

since  H > 

In  p, 

by  Fact  3.1 

In  p 

P 

!_  (Necessary 

and  Sufficient  Conditions  for  S < 

P 

_E_  < _2_ 

iff  P q - 1 

P 

H In  p 

V qi  P 

> 0, 

E 

qi  = l 

P 

Zj 

i=l  i 

i = l 

< P y.  

1 < _L 

p 0 

> 

H 

n 

Hn  * P 

- Hn  Y 

n . n 

T. 

z ? 

i=l 


i=l 


P g 1 

4 £ qi-  p > 0 


i=l 


z!±ii 

i=l  i 


>04Zf  4 Sp< 

i=l  1 p p 


J < _E_ 

P-  ^ q -Hp 


E -i 


i = l 


That  _g_  _g follows  from  Fact  3.1 

it  ^ t _ _ 


□ 


H 


In  p 


Hence,  we  have  shown  that  under  certain  general  circumstances,  the  maximum 

speedup  for  a p-parallel  processor  is  less  than  _p_  , and  we  have  found 

In  p 

necessary  and  sufficient  conditions  for  this  to  be  true. 

The  condition  p q^-  1_  >.  0,  may  be  interpreted  as  follows: 

i=l  i 


Let  ei  = q.-  1 
P 

Let  A = ^ c 

2-  !i_ 


> 0 i 


E ii 

< 0 i 


Then  A > B 
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r 


In  words,  this  means  that  for  all  those  i,  where  the  probability  of  using  i 

processors  is  greater  than  1,  then  the  sum  of  these  diferences  ( . ),  weighted 
1 P ] 

by  4-,  is  the  term  A.  B is  the  sum  of  the  differences  (e. ),  weighted  by  7 , 

where  the  probability  of  using  i processors  is  less  than  p.  The  condition 

requires  the  positive  sum.  A,  to  be  greater  than  the  negative  sum,  B. 

For  example,  since  the  function  f ( i ) = i , i=l,2,...,p,  is  a strictly 
decreasing  function,  then  if  the  are  also  non-increasing,  i.e.,  q^q^^. 
then  clearly  the  above  condition  is  satisfied,  and  Sp<  . 

This  is  stated  in  the  following  corollary: 


Corollary  3.1  If 


and 


E V’ 

i = I 


then  S < 
P 


In  p 


Proof:  If  q,  > q0  > . . . > q then  Flk  such  that 

1—2—  — p 


(9ilp  for  1 < i 

I 


q.  < — for  k+1  < i < p 

p — — 

Also,  since  ^ q^=l,  we  have  ^(q.j-  = ' P 


i = l 


P 

E 

i = l 


qi~  P = 


k 

E 

i=l 


qi~  P 


i = l 
P 

E 

i=k+l 


E"' 

i=k+l 


(<i«- 


k,-l 

1 p 


E 

i =1 


1 

k+1 


/ k n 1 \ 

1 

(e  Vp)' 

kTk+ry 

M = 1 ' 

_>  0,  since  each  term  is  positive.  E 

We  note  that  Lemma  3.1,  or  the  equal ly-1 i kely  hypothesis  is  just  the 
special  case  of  Corol  1 ary  3.1,  when  q-|  = q£  = ...  = q^. 

From  the  proof  of  Corollary  3.1,  it  is  clear  that  we  do  not  need  the  q^ 
to  be  decreasing  with  increasing  i.  All  that  is  required  is  that  there  is  a 
k for  which  the  values  of  qi , for  i £ k,  are  greater  than  or  equal  to  -,  and 

the  values  of  q. , for  i > k,  are  less  than  1_.  This  leads  to  the 

P 
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following  corollary  which  gives  a sufficient  condition  for  S < that 

3 p In  p 

is  more  general  than  that  of  ^ ^2—  — q ’ 


Corollary  3.2  If  3 k such  that  /q^  _>^-,for  1 _<  i _<  k , and 

( q . < 1 , for  k+1  £ i £ p 
P 


Eqi  = 1’  then  Sp  "Tnp 
i=l 

Proof:  same  as  for  corollary  3.1 


Actually,  the  tightest  upper  bound  that  we  have  derived  in  Theorem  3.1 
is  that  iff  ^ - — £ 0.  Also  we  know  from  Fact  2.4  that  the 

P i=l  p 

maximum  speedup  attainable  for  a given  computation  (assuming  unlimited  processors, 

i.e.,  p£p*)  is  S ^ < 7|  . For  p < p*,  we  intuitively  expect  a decrease  in 

P "p* 


r 

the  speedup.  However,  we  can  show  that,  in  fact,  the  upper  bound  of  the  speedup 
for  p < p*  need  not  necessarily  be  decreased. 

Corollary  3.3  3 p < p*  such  that  S > when  S £ jj-  for  p>p* 


Proof:  Let  q.  ,1<  i<  p*  be  such  that  S £ £ — < tP—  for  p > p*  . 

1 p p*  p 

Let  a.  , 1 £ i £ p , be  the  probability  that  i processors  will  be 
used  when  p < p*. 

Let  ai  = qi=  p*  » for  1 £ i £ P-1  , and  ap  £ qi  = P*p£+- 


Now,  £ ^ = £ ^ + 
i=l  1 i=l  i~ 


(H  +1  + ) 
p.p*  p*  p-1  p p ' 


= i_(l+l+  +lul 

p*  2 3 •••  P ’ p 


< 1(1  + ...  +1)  + 1=^ 

P 2 P'  P P 


' Sp  £ $ ' HP 
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It  should  be  noted  that  we  have  used  the  expression  yjj— p as  an  approximation 

to  the  (tighter)  upper  bound  of  jj—  , because  In  p is  a more  well-tabulated  and 

P 

easily-recognized  function  than  Hp.  In  fact,  Hp-+  In  p as  p + °°  . But  for  small 
values  of  p,  especially  of  p<^8,  we  should  really  use  the  actual  upper  bound  of 
tP—  , since  ^r— ^ — shows  some  anomalous  behavior  which  is  not  exhibited  by  rr^-  , 

p D p P p 

e.g.  , has  a minimum  point  at  p=3,  whereas  is  a strictly  increasing 

P n P n 

function.  Also,  for  p=l , yE-  is  undefined  whereas  = 1. 

n P p 

The  results  of  this  chapter  are  summarized  in  the  following  figures  and 
corol 1 aries : 


Figure  3 shows  the  speedup,  S , versus  p,  for  one  fixed  computation  (i.e., 
p*  is  fixed  at  36,  p is  variable).  We  note  that  a maximum  speedup  of  an  order 
of  magnitude  (S  <10)  is  attainable  when  p*=36,  in  cases  (i)  and  (ii).  In  any 
case,  the  speedup  cannot  be  greater  than  p*.  Hence,  no  improvement  in  speed  can 
result  from  using  more  physical  processors,  p,  than  the  maximum  number  of  logical 
parallel  processes,  p*,  in  the  computation. 

Figure  4 shows  the  speedup,  S , versus  p*,  for  one  fixed  p-parallel 
processor  (i.e.,  p is  fixed  at  16  processors,  p*  is  variable).  This  time,  the 
speedup  for  different  computations  depends  on  whether  the  computation  has  more 
inherent  parallelism,  p*,  than  the  computer  system,  with  a fixed  number  p of 
parallel  processors,  or  not. 


Figure  5 considers  the  maximum  attainable  speedup,  Sp*,  of  each  computation, 
over  all  possible  computations.  Hence,  in  each  case,  we  choose  a computer 
organization  with  p=p*  processors,  we  see  that  Sp*  is  limited  by  either  of 


_EL*_ 

In  p* 


or  p*. 


We  summarize  the  attainable  speedup  regions  for  one  computation: 
Corollary  3.4:  For  any  given  computation,  one  of  the  following  is  true: 


( i ) Sp  < mi  n 

» P*  ) 
yin  p In  p */ 

iff 

P 

E 

vi 

> 0 

i = l 

i 

( i i ) Sp  < mi  n 

/p  , p*  \ if 
\ In  p */ 

P 

E 

V 

1 

J3_  < 0 

but 

i=l 

i 

P* 

r . - 

1 

E 

P*  > o 

i=l 

i 

(iii)  Sp<min  (p,p*)  otherwise 
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— 


We  have  established  the  fact  that  the  speedup  depends  on  both  the 
architecture  and  the  computation.  However,  we  often  wish  to  express  speedup 
bounds  solely  in  terms  of  the  architecture  in  order  that  we  may  evaluate 


the  performance  of  a particular  architectural  feature,  e.g.,  the  number  of 
parallel  processors  available,  p.  The  next  corollary  gives  computation- 
independent  speedup  bounds: 


Corollary  3.5 
(i)  Sp-7n“p 


iff 


P 

£ 

i=l 


ql± 

i 


>0 


(ii)  S _<p  > otherwise. 

Similarly,  we  can  express  the  maximum  speedup  of  any  given  computation, 
in  the  following  purely  computation-dependent  speedup  bounds: 


Corollary  3.6 


The  maximum  speedup  of  a given  computation,  assuming  sufficient  physical 
parallel  processors  ( p > p* ) is: 


0)  s < 


p — 1 n p* 


, iff 


P* 


rr  - 


i=l 


> 0 


(ii)  S^p*  , otherwise. 
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Comparison  with  Empirical  Results 

Kuck,  et  al  [3-5]  have  written  a sophisticated  program  analyser  which 
turns  ordinary  Fortran  programs  written  for  uniprocessors  into  the  format  of 
a "computation"  as  defined  in  our  model  of  computation  on  a p-parallel  processor. 
They  have  then  carried  out  experiments  on  a wide  variety  of  programs  to  observe 
the  speedup  attainable  by  ordinary  programs  when  a finite  but  unlimited  number 
of  parallel  processors  are  available  on  which  to  execute  these  programs.  In 
terms  of  parameters  in  our  model,  they  have  chosen  p=p*  processors,  where  p* 
is  determined  for  each  program  by  their  program  analyser. 

The  eighty-six  programs  they  have  analysed  are  summarixed  into  seven 
categories,  as  can  be  seen  in  Table  1.  The  average  over  these  7 categories  is 
the  entry  labelled  "ALL".  As  this  work  comprises  perhaps  the  major  experimental 
effort  in  the  area  of  determining  speedup  ratios  of  programs  on  parallel  processor 
systems,  we  cannot  ignore  these  results.  Hence  we  use  these  experimental  results 
as  raw  data  by  which  we  compare  our  theoretically  derived  upper  bound  for  speedup 
of  0 ( i ^ p#).  Kuck's  results  are  ploted  in  figure  5,  the  graph  of  Sp*  versus  p* , 
and  tabulated  in  Table  1.  As  can  be  seen,  there  is  good  agreement  between  Kuck's 
experimental  results  and  corollary  3.6.  Only  2 of  the  7 categories  exceed  the 
bound,  viz,  NUME  and  EIS.  On  examining  these  two  types  of  computations  in  greater 
detail,  we  find  that  the  two  points  they  share  in  common  are: 

(1)  they  have  the  two  highest  average  number  of  nested  DOs 

(2)  they  have  the  two  highest  average  number  of  Assignment-Statement 
Blocks  inside  DO  loops. 

NUME  contains  standard  numerical  analysis  programs  and  EIS  are  eigenvalue 
programs.  We  note  that  both  these  are  matrix-type  computations  with  a large 
amount  of  inherent  parallelism. 

They  have  a large  number  of  iterative  loops  acting  on  disjoint  elements  in 
matrices.  In  the  earlier  section  on  "Typical  Speedup  Ratios",  We  have  already 
mentioned  that  such  types  of  computations  have  maximum  speedup  ratios  of  k^p, 
or  linear  with  the  number  of  processors  available. 

Since  NUME  and  EIS  consist  of  18  out  of  a total  of  86  programs  studied, 

D* 

we  can  say  that  about  80%  of  the  programs  have  speedups  less  than  + . 
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4.  RELATED  PERFORMANCE  MEASURES 


Based  on  the  definitions  in  Chapter  2,  we  can  obtain  related  performance 
bounds  corresponding  to  corollaries  3.4,  3.5.  and  3.6  for  each  of  execution 
time  T , efficiency  Ep,  and  space-time  product  STp,  for  a p-parallel  processor. 

We  will  use  the  following  notation: 


Condition  A: 


Conditon  B: 


E 1 p 

& i 


> 0 


P 

E 

i = l 


qi~  P 


< 0 but 


P* 

E 

i = l 


> 0 


We  will  state  only  those  results  corresponding  to  corollary  3.4,  where 
the  dependence  of  the  performance  bound  on  both  the  architecture  and  the 
computation  is  shown. 

Minimum  Execution  Time 

Theorem  4. 1 

For  any  given  computation,  one  of  the  following  is  true: 

(i)  Tp  _>  T.j  • max  , 1 n p^-  ) iff  condition  A 

(ii)  Tp  _>  T.j  • max  np»--)  if  condition  B 

(iii)  T > T.  • max  (-  , -*)  otherwise 

p — 1 p p* 

In  Figure  6,  we  plot  Tp  versus  p for  a given  computation  (p*  fixed  at  36). 
In  Figure  7,  we  plot  Tp*  versus  p*,  for  all  possible  computations. 

From  both  figures,  we  can  see  that  it  is  no  use  increasing  the  number  of 
physical  processors  used  beyond  p*,  since  the  minimum  execution  time  for  the 
computation  has  a lower  bound  of  either  (1np*  )T-|  or  depending  only  on 

the  probability  distribution  {r^,  1 _<  i <_  p*}  of  the  computation,  independent 
of  the  computer  system. 
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Maximum  Efficiency 


Theorem  4.2 

For  any  given  computation,  one  of  the  following  is  true: 
(i)  Ep  i min(TruT  ’p^TthS*)  iff  condition  A 

D* 

(i i ) Ep  < min(l  . p»  ) if  condition  B 
(iii)  Ep  £ min(l  , ) otherwise 


In  Figure  8,  we  show  Ep  versus  p for  one  fixed  computation  (i.e,  p*  given). 

We  note  that  the  efficiency  drops  rapidly  as  we  increase p.  For  case  (i),  with 
p*=36,  we  achieve  an  order  of  magnitude  speed  improvement  with  a maximum  efficiency 
of  0.28.  Under  any  circumstances,  there  is  no  value  in  increasing  p beyond  p*, 
since  the  maximum  efficiency  deteriorates  at  the  rate  of  while  the  maximum 
speedup  remains  constant.  (See  also  Figure  3) 


In  Figure  9,  we  show  Ep*  versus  p*  for  all  possible  computations.  This 
is  the  maximum  attainable  efficiency  when  p=p*  processors  are  used,  giving 
the  maximum  speedup,  Sp*.  (See  also  Figure  5) 

Minimum  Space-Time  Product 

Theorem  4.3 

For  any  given  computation,  one  of  the  following  is  true: 

(i)  ST  >T,  *max(ln  p , ^-*ln  p*),  iff  condition  A 

p — i p 

(ii)  STp-Ti  ' max(l  » p*ln  P*),  if  condition  B 

(iii)  STp>d-j  • max(l  , jj*)  otherwise 


In  Figure  10,  we  plot  STp*  versus  p*.  For  case  (i),  we  see  that  the 
performance  of  the  p-parallel  processor,  measured  in  terms  of  the  relative 
space-time  product  (normalised  by  T^),  increases  at  least  as  In  p*. 


-20- 


5.  CONCLUSIONS 


We  have  defined  a general  model  of  computation  on  a p-parallel  processor, 
and  isolated  performance  measures  by  which  we  may  evaluate  the  perfori  ance 
improvements,  if  any,  of  p-parallel  processor  systems  over  uniprocessor  systems. 
We  have  shown  how  the  performance  depends  on  both  the  computer  architecture 
and  the  computation,  and  distinguished  clearly  between  the  number  of 
physical  parallel  processors  available  (p)  and  the  maximum  number  of  logical 
parallel  processes  (p*)  inherent  in  the  computation. 


We  then  derived  necessary  and  sufficient  conditions,  q..-  ~ > 0 and 


p q . =1 , under  which : 

Z— * 

i = l 

(ij  speedup,  Sn  < min(  pn  , 


i = l 


P - 


'In  p ’ In  p* 


) 


(ii)  execution  time,  T j>  ^ . max  (—-P-  , ) 


(iii)  efficiency,  Ep  £ min(y^j— p- 


P* 

P 


1 

In  p* 


) 


(iv)  space-time  product,  ST  > T, . max(ln  p,  p . In  p*) 

P ~ 1 p* 

In  each  case.  In  p is  an  approximation  for  the  pth.  harmonic  number  Hp. 

Despite  the  many  different  views  that  exist  on  the  potential  performance 
improvements  of  parallel  processor  systems  over  uniprocessor  systems,  the 


upper  speedup  bound,  Sp  £ min(y^p  , 


),  has  never  been  established 


before.  Furthermore,  comparison  with  the  extensive  experimental  results  of 

P* 

Kirk  et  al  indicate  that  empirical  speedups  obtained  are  indeed  0 ( yn  p* ) • 
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Figure  3:  Sp  Versus  p ( p*= 36 ) 


h<  Atta i nab  1 e Speedup  Regions , for  a Gi yen  Computati on : 
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V ",n  'ifp-ifp*1  iff  e xi  ° 

i = l 1 


i) 


o 


S < mi  n 

p- 


<"  ■ iS  P.>  ,f  E 


P,-- 


1 


i o < 0 but 


i = l 


i = 
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S„  < mi  n ( p , ) 
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The  Maximum  Attainable  Speedup  Regions,  fo-  all  Conputa  t jo  ns 


S * < , if  r 

i=l 


P - 1-  p*  ' " £ -!j*-  i 0 


Sp*  1 P*  » otherwise 


Kuck's  experimental  results 


60 


Versus  p*  (p=p*) 
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P* 


The  Attainable  Efficiency  Regions  for  a given  computation 
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(i1)  E3  Ep<min(l  , ) 

(iii)  V//X  E < mi n ( 1 , jj*  ) 
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